The project categorizes 160K food orders from 100 different pizza shops. Multiple iterations of the ‘Bag of Words’ model alongside K-means clustering were used to solve the problem.
# imort libraries
import pandas as pd
# Import custom functions
from functions.data_preprocess import stopwords_n_stemming
from functions.plotting import plotly_pie_chart
from functions.clustering import Clustering_functions
# Instantiate class
clf = Clustering_functions()
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\etsia\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
# Read original dataset
original_dataset = pd.read_csv('orderItems.csv')
# Get product_category and product name
testset = original_dataset.iloc[0:len(original_dataset), [7,11]].values
corpusname = stopwords_n_stemming(testset[:,0])
corpuscat = stopwords_n_stemming(testset[:,1])
# Creating a pd DataFrame with original index, product category, name and price
dataset = original_dataset.iloc[0:len(testset), [9]]
indexx=list(range(0, len(testset)))
dataset.insert(0, "original index", indexx, True)
dataset.insert(1, "product_category_name", corpuscat, True)
dataset.insert(2, "product_name", corpusname, True)
dataset.head(10)
| original index | product_category_name | product_name | product_type_price | |
|---|---|---|---|---|
| 0 | 0 | gourmet pizza | chicken cordon bleu pizza | 16.95 |
| 1 | 1 | starter | tasti garlic bread marinara sauc mozzarella chees | 4.75 |
| 2 | 2 | beverag | soda | 4.45 |
| 3 | 3 | pizza | tradit plain chees pizza | 8.00 |
| 4 | 4 | side order | onion ring | 4.99 |
| 5 | 5 | specialti pizza | roma pizza | 13.99 |
| 6 | 6 | side order | french fri gravi | 3.50 |
| 7 | 7 | greek specialti | lamb beef gyro platter | 11.99 |
| 8 | 8 | pasta dish | chicken il palio pasta | 8.99 |
| 9 | 9 | specialti pizza | tandori chicken pizza | 15.99 |
# define the plotting package
# Jupyter notebook is able to create interactive plotly figures
plotting_package = 'plotly'
# define if figures will be exported locally
export_graph = True
pizza_df = dataset[dataset['product_category_name'].str.contains(r'pizza')].copy()
nonpizza_df = dataset[~dataset['product_category_name'].str.contains(r'pizza')].copy()
# create a pizza - non pizza pie chart
labels = 'pizzas', 'non-pizza products'
plot_title = 'pizza - non-pizza product distribution'
sizes = [(len(pizza_df)/len(dataset))*100, (len(nonpizza_df)/len(dataset))*100]
fig = plotly_pie_chart(labels, sizes, plot_title, export_graph)
nclusters_pizza = 30 # the number of clusters
nclusters_cat_nopizza, nclusters_name_nopizza = 30, 15 # the number of clusters
max_features = 50 # the maximum amount of features for the Bag of Words
# Conduct K-means clustering based on product category
print('K-means clustering: pizza products, by product-category')
cat_y_kmeans, cat_clusternames, pizza_categories_df = clf.complete_clustering(pizza_df, 1,\
nclusters_pizza, max_features, 'Initial pizza categories', 'predicted_category', plotting_package, export_graph)
# Conduct K-means clustering based on product name
print('K-means clustering: pizza products, by product-name')
name_y_kmeans, name_clusternames, pizza_names_df = clf.complete_clustering(pizza_df, 2,
nclusters_pizza, max_features, 'Pizza products', 'predicted_name', plotting_package, export_graph)
# Update the pizza dataframe
pizza_df.insert(4, 'predicted_category', pizza_categories_df['predicted_category'])
pizza_df.insert(5, 'predicted_name', pizza_names_df['predicted_name'])
pizza_df.head(20)
K-means clustering: pizza products, by product-category Complete loss of information will occur for 0.0% of products initial clustering inertia: 2042.09
K-means clustering: pizza products, by product-name Complete loss of information will occur for 0.08% of products initial clustering inertia: 14307.34
| original index | product_category_name | product_name | product_type_price | predicted_category | predicted_name | |
|---|---|---|---|---|---|---|
| 0 | 0 | gourmet pizza | chicken cordon bleu pizza | 16.95 | gourmet pizza | chicken pizza |
| 3 | 3 | pizza | tradit plain chees pizza | 8.00 | pizza | chees pizza plain |
| 5 | 5 | specialti pizza | roma pizza | 13.99 | pizza specialti | pizza |
| 9 | 9 | specialti pizza | tandori chicken pizza | 15.99 | pizza specialti | chicken pizza |
| 11 | 11 | pizza | chees pizza | 12.99 | pizza | chees pizza |
| 13 | 13 | specialti pizza | hawaiian pizza | 17.49 | pizza specialti | pizza |
| 16 | 16 | deep dish pizza | chicago deep dish pizza | 17.00 | deep dish pizza | chees deep dish pizza |
| 18 | 18 | classic new york pizza | famou chees pizza | 6.99 | classic new pizza york | chees famou pizza |
| 19 | 19 | pizza | chees pizza | 17.95 | pizza | chees pizza |
| 25 | 25 | specialti pizza | amalfi pizza | 18.99 | pizza specialti | pizza |
| 28 | 28 | pizza | chees pizza | 13.99 | pizza | chees pizza |
| 32 | 32 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 34 | 34 | gourmet pizza | amigo combo climax pizza | 24.99 | gourmet pizza | pizza |
| 40 | 40 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 44 | 44 | specialti pizza | meat combo pizza | 16.99 | pizza specialti | pizza |
| 45 | 45 | pizza | chees pizza | 9.75 | pizza | chees pizza |
| 47 | 47 | new york style gourmet pizza | veggi suprem pizza | 19.00 | gourmet new pizza style york | pizza veggi |
| 48 | 48 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 53 | 53 | tradit ny pizza | chees pizza | 10.99 | ny pizza tradit | chees pizza |
| 56 | 56 | specialti pizza | tandori chicken pizza | 17.99 | pizza specialti | chicken pizza |
#Conduct K-means clustering on product category
print('K-means clustering: non-pizza products, by product-category')
cat_y_kmeans, cat_clusternames, nonpizza_categories_df = clf.complete_clustering(nonpizza_df, 1,\
nclusters_cat_nopizza, max_features, 'Initial nonpizza categories', 'predicted_category', plotting_package, export_graph)
# Update the nonpizza dataframe
nonpizza_df.insert(4, 'predicted_category', nonpizza_categories_df['predicted_category'])
# Conduct K-means clustering based on product name
nonpizza_df.insert(5, 'predicted_name', '') # create an empty column to be updated
for jj in range(nclusters_cat_nopizza):
print('K-means clustering: ' + cat_clusternames[jj] + ' products, by product-name')
# Get the data sub-set
target_product_cat = nonpizza_df[nonpizza_df['predicted_category'] == cat_clusternames[jj]].copy()
# Conduct K-means clustering
name_y_kmeans, name_clusternames, target_product_names = clf.complete_clustering(target_product_cat, 2,\
nclusters_name_nopizza, max_features, cat_clusternames[jj], 'predicted_name', plotting_package, export_graph)
# Update the nonpizza dataframe
nonpizza_df.update(target_product_names)
del target_product_cat, name_y_kmeans, name_clusternames, target_product_names
K-means clustering: non-pizza products, by product-category Complete loss of information will occur for 3.1% of products initial clustering inertia: 21471.89
K-means clustering: appet products, by product-name Complete loss of information will occur for 4.34% of products initial clustering inertia: 9506.99
K-means clustering: display non product products, by product-name Complete loss of information will occur for 1.18% of products initial clustering inertia: 9808.23
K-means clustering: various products, by product-name Complete loss of information will occur for 10.01% of products initial clustering inertia: 10026.3
K-means clustering: hot sub products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 2350.28
K-means clustering: beverag products, by product-name Complete loss of information will occur for 0.35% of products initial clustering inertia: 531.15
K-means clustering: sandwich products, by product-name Complete loss of information will occur for 0.1% of products initial clustering inertia: 2450.85
K-means clustering: salad products, by product-name Complete loss of information will occur for 0.04% of products initial clustering inertia: 3211.72
K-means clustering: order side products, by product-name Complete loss of information will occur for 2.34% of products initial clustering inertia: 3366.2
K-means clustering: wing products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 814.03
K-means clustering: dish pasta products, by product-name Complete loss of information will occur for 1.84% of products initial clustering inertia: 2672.54
K-means clustering: special products, by product-name Complete loss of information will occur for 4.53% of products initial clustering inertia: 3198.57
K-means clustering: dessert products, by product-name Complete loss of information will occur for 2.16% of products initial clustering inertia: 1803.42
K-means clustering: calzon stromboli products, by product-name Complete loss of information will occur for 0.25% of products initial clustering inertia: 807.54
K-means clustering: wing products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 814.03
K-means clustering: wrap products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 1077.42
K-means clustering: pasta products, by product-name Complete loss of information will occur for 0.57% of products initial clustering inertia: 1624.8
K-means clustering: calzon products, by product-name Complete loss of information will occur for 0.17% of products initial clustering inertia: 980.4
K-means clustering: steak products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 280.25
K-means clustering: cold sub products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 496.5
K-means clustering: side products, by product-name Complete loss of information will occur for 1.17% of products initial clustering inertia: 735.85
K-means clustering: cheesesteak products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 256.78
K-means clustering: entre products, by product-name Complete loss of information will occur for 1.7% of products initial clustering inertia: 1100.75
K-means clustering: specialti products, by product-name Complete loss of information will occur for 1.33% of products initial clustering inertia: 1068.46
K-means clustering: kid menu products, by product-name Complete loss of information will occur for 1.47% of products initial clustering inertia: 919.65
K-means clustering: hot sandwich products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 1210.95
K-means clustering: burger products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 313.99
K-means clustering: grinder products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 715.62
K-means clustering: sub products, by product-name Complete loss of information will occur for 0.0% of products initial clustering inertia: 757.56
K-means clustering: item popular products, by product-name Complete loss of information will occur for 1.43% of products initial clustering inertia: 189.62
K-means clustering: chicken products, by product-name Complete loss of information will occur for 0.13% of products initial clustering inertia: 471.46
final_df = original_dataset.copy()
final_df.insert(12, 'predicted_category', '')
final_df.insert(13, 'predicted_name', '')
final_df.update(nonpizza_df)
final_df.update(pizza_df)
pizza_df.head(20)
| original index | product_category_name | product_name | product_type_price | predicted_category | predicted_name | |
|---|---|---|---|---|---|---|
| 0 | 0 | gourmet pizza | chicken cordon bleu pizza | 16.95 | gourmet pizza | chicken pizza |
| 3 | 3 | pizza | tradit plain chees pizza | 8.00 | pizza | chees pizza plain |
| 5 | 5 | specialti pizza | roma pizza | 13.99 | pizza specialti | pizza |
| 9 | 9 | specialti pizza | tandori chicken pizza | 15.99 | pizza specialti | chicken pizza |
| 11 | 11 | pizza | chees pizza | 12.99 | pizza | chees pizza |
| 13 | 13 | specialti pizza | hawaiian pizza | 17.49 | pizza specialti | pizza |
| 16 | 16 | deep dish pizza | chicago deep dish pizza | 17.00 | deep dish pizza | chees deep dish pizza |
| 18 | 18 | classic new york pizza | famou chees pizza | 6.99 | classic new pizza york | chees famou pizza |
| 19 | 19 | pizza | chees pizza | 17.95 | pizza | chees pizza |
| 25 | 25 | specialti pizza | amalfi pizza | 18.99 | pizza specialti | pizza |
| 28 | 28 | pizza | chees pizza | 13.99 | pizza | chees pizza |
| 32 | 32 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 34 | 34 | gourmet pizza | amigo combo climax pizza | 24.99 | gourmet pizza | pizza |
| 40 | 40 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 44 | 44 | specialti pizza | meat combo pizza | 16.99 | pizza specialti | pizza |
| 45 | 45 | pizza | chees pizza | 9.75 | pizza | chees pizza |
| 47 | 47 | new york style gourmet pizza | veggi suprem pizza | 19.00 | gourmet new pizza style york | pizza veggi |
| 48 | 48 | new york style pizza | creat new york pizza | 17.00 | new pizza style york | creat new pizza york |
| 53 | 53 | tradit ny pizza | chees pizza | 10.99 | ny pizza tradit | chees pizza |
| 56 | 56 | specialti pizza | tandori chicken pizza | 17.99 | pizza specialti | chicken pizza |